Packages to install

install.packages("twitteR")
install.packages("wordcloud2")
install.packages("tidyverse")
install.packages("tidytext")
install.packages("knitr")
install.packages("plotly")
devtools::install_github("ropenscilabs/icon") # to insert icons
devtools::install_github("hadley/emo") # to insert emoji
library(knitr)
library(magick)
## Linking to ImageMagick 6.9.9.39
## Enabled features: cairo, fontconfig, freetype, lcms, pango, rsvg, webp
## Disabled features: fftw, ghostscript, x11
library(png)
library(grid)
library(emo)
library(icon)
library(twitteR)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::id()       masks twitteR::id()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ dplyr::location() masks twitteR::location()

Reference material for the following tutorial

Downloading tweets with twitteR

Once you have connected to twitter and accessed your API, you are ready to download tweets. Let’s try to return the 3 most recent tweets that contain the #useR2018 hashtag.

> useR_tweets <- searchTwitter("#useR2018", n = 3, resultType = "recent")
> useR_tweets <- twListToDF(useR_tweets)
> # kable(useR_tweets[,1:3])

The searchTwitter function will return a status object of 10 most recent tweets that contain #useR2018 and their properties. This status object can be changed into a user-friendly data.frame with the function twListToDF. This data.frame contains the following columns:

This is a very simple search. Like any search engine, twitter allows much more complicated searches using Boolean logic. For example, we can look for tweets containing #useR2018 hashtag as well as mentioning #rstats. If you are unsure whether your query will work, you can try it in the twitter browser.

> useR_rstats_tweets <- searchTwitter("#useR2018 AND #rstats", n = 10, resultType = "recent")
> useR_rstats_tweets <- twListToDF(useR_rstats_tweets)
> kable(useR_rstats_tweets[3:5, ])

Here are some common examples of twitter search queries (none of these are case sensitive):

Search Query Tweets …
from:thomasp85 sent from Thomas Lin Pedersen
to:thomasp85 sent to Thomas Lin Pedersen
`R rules` containing exact phrase
#rstats -#useR2018 containing #rstats but not #useR2018
#best OR #useR2018 containing #best or #useR2018 or both

The searchTwitter function also allows you to specify some further restrictions on your searches:

Note that the time restrictions only work for time restrictions within the last two weeks, when you are connecting through a standard API (see discussion in the next section).

It is important to document your data collection process clearly, so it can be reproduced by other people. Mike Kearney, the author of the rtweet 📦, recently published a minimum list of information required to reproduce data collection from twitter. We have slightly modified his list here 😉:

Downloading tweets for a time series

With a standard API twitter limits users to downloading tweets from the last two weeks. Furthermore, the twitter rate limits cap the number of search results returned to 18,000 every 15 minutes. When you are trying to collect large datasets of tweets or tweets from a longer time period, this poses a problem. One way around this limitation is to set automatic downloads using the cronjob 📦 for Linux/Unix or the scheduler 📦 for Windows.

We wanted to download tweets relating to the Royal Wedding 👑 of Meghan Markle 👰 and Prince Harry 🤵, held on the 19th of May 2018 in Windsor Castle, UK. In order to obtain representative tweets from before, around, and after the time of the wedding we collected 3,000 tweets every week beginning at the 15th of March to the 31st of May (Thursday mornings, Australian time). We also collected 3,000 tweet every day for 12 days beginning from the 17th of May and ending on the 28th of May. We only collected tweets containing any of the following hashtags:

> Hashtags <- read_csv("Hashtags.csv", col_names = TRUE)
> kable(Hashtags)
Hashtags
#MeghanMarkle
#TheRoyalWedding
#PrinceHarry
#HarryAndMeghan
#MeghanAndHarry
#RoyalWedding

The following was the exact search query we used in our cronjob (we are only going to explain how to work with cronjob here 😞). Note that we are no longer requesting the most recent tweets but instead are requesting to be returned a mix of real time and most popular tweets.

> rla_tweet <- searchTwitter(paste0(Hashtags$Hashtags, collapse = " OR "), n = 3000, 
+     lang = "en")

For the cronjob to work, we needed to setup a R script with the all authentication details, the hashtags, the query as well as saving the tweets into a .csv file, which we labled with the date of collection. This R script can then be scheduled to run as often as you like through the cronjob add-in. Just click on the add-in button in RStudio and find cronjob, where you can interactively schedule your script to run:

Analyse the 👑 tweets!

Step 1: Load tweets into R

  • Load the tidy tools 📦.
> library(tidyverse)
> library(tidytext)
  • Load the time series of royal 👑 tweets.

Here we load the .csv files containing the royal tweets downloaded weekly and cohercing them into a data.frame. Every .csv file was downloaded with the twitteR 📦 and saved using the code shown above.

The Twitter_Data 📁 contains the time series of royal 👑 tweets:

> data <- list.files(path = file.path("Twitter_Data"), pattern = ".csv", full.names = TRUE)
> data
##  [1] "Twitter_Data/2018-03-15.csv"     "Twitter_Data/2018-03-22.csv"    
##  [3] "Twitter_Data/2018-03-29.csv"     "Twitter_Data/2018-04-05.csv"    
##  [5] "Twitter_Data/2018-04-12.csv"     "Twitter_Data/2018-04-19.csv"    
##  [7] "Twitter_Data/2018-04-26.csv"     "Twitter_Data/2018-05-03.csv"    
##  [9] "Twitter_Data/2018-05-10.csv"     "Twitter_Data/2018-05-17_day.csv"
## [11] "Twitter_Data/2018-05-17.csv"     "Twitter_Data/2018-05-18_day.csv"
## [13] "Twitter_Data/2018-05-19_day.csv" "Twitter_Data/2018-05-20_day.csv"
## [15] "Twitter_Data/2018-05-21_day.csv" "Twitter_Data/2018-05-22_day.csv"
## [17] "Twitter_Data/2018-05-23_day.csv" "Twitter_Data/2018-05-24_day.csv"
## [19] "Twitter_Data/2018-05-24.csv"     "Twitter_Data/2018-05-25_day.csv"
## [21] "Twitter_Data/2018-05-26_day.csv" "Twitter_Data/2018-05-27_day.csv"
## [23] "Twitter_Data/2018-05-28_day.csv" "Twitter_Data/2018-05-29_day.csv"
## [25] "Twitter_Data/2018-05-31.csv"

We first load every .csv file and save them as argument of the list tweets and then combine them into a data.frame.

> # Read into R all the .csv file and save each of them as an argument of a
> # list
> all_tweets <- lapply(data, function(tweets) {
+     data_tweets <- read_csv(tweets)
+     return(data_tweets)
+ })
> 
> # Combine the .csv file into a data.frame
> all_tweets <- do.call(rbind, all_tweets)
> head(all_tweets)[, c(2:5)]
## # A tibble: 6 x 4
##   text                                   favorited favoriteCount replyToSN
##   <chr>                                  <lgl>             <int> <chr>    
## 1 RT @acpfonline: This is a wind up sur… FALSE                 0 <NA>     
## 2 RT @esricanada: Discover details abou… FALSE                 0 <NA>     
## 3 RT @chrisshipitv: While the Royal Fam… FALSE                 0 <NA>     
## 4 There should be no taxpayer funding f… FALSE                 0 <NA>     
## 5 Doing some homework for the #RoyalWed… FALSE                 0 <NA>     
## 6 #AndForWhatItsWorth I<U+2019>m still … FALSE                 0 <NA>

Tweets were downloaded from 2018-03-08 14:01:12 to 2018-05-31 01:45:37.

Now tweets are stored into the column text of data frame all_tweets and they are stored as a character vector.

The tidytext 📦 philosophy, extensively explained in Text mining with R, consist of having a table with one-token-per-row. A token is defined as a meaningful unit of text. In the simplest way it will be a single word but it can also be pairs/triplets of consecutive words and so on.

Step 2: Convert tweets to the tidy text format: tidytext::unnest_token

The main tidytext 📦 function which will do this for us is unnest_token. Let’s look at a quick example to see how it works!

Create one-token-per-row data frame

> all_tweets %>%
+   filter(!duplicated(all_tweets)) %>% # remove duplicated tweets
+   mutate(tweetID = 1:n()) %>% # set a TweetID column, X1 was too generic
+   select(-X1) %>%
+   unnest_tokens(output = word, input = text, token = "words") %>% # convert tweet in the text column to token = words
+   select(tweetID,word) %>%
+   count(word)
## # A tibble: 57,332 x 2
##    word                                      n
##    <chr>                                 <int>
##  1 ____                                      3
##  2 _____                                     1
##  3 ______________                           12
##  4 ___________________                       1
##  5 _____________________                     3
##  6 _____________________________________     1
##  7 ___fittaymuu                              2
##  8 ___mkc___                                 5
##  9 ___q__                                    1
## 10 __christan                                3
## # ... with 57,322 more rows

There is a lot of useless text in here 😱 !!

Create one-token-per-row where token are two consecutive words

> all_tweets %>% filter(!duplicated(all_tweets)) %>% mutate(tweetID = 1:n()) %>% 
+     select(-X1) %>% unnest_tokens(output = word, input = text, token = "ngrams", 
+     n = 2) %>% select(tweetID, word)
## # A tibble: 1,801,795 x 2
##    tweetID word       
##      <int> <chr>      
##  1   72068 _ajayv omg 
##  2   72068 omg the    
##  3   72068 the wedding
##  4   72068 wedding in 
##  5   72068 in this    
##  6   72068 this video 
##  7   72068 video was  
##  8   72068 was better 
##  9   72068 better than
## 10   72068 than the   
## # ... with 1,801,785 more rows

Looks better but we will do some more tweaks to make it more usable later!


With the code below, we first create a new column retweet_from in the data frame all_tweets using mutate(RT = str_extract(string = text,pattern = "^RT @(\\w+)")). In each row, this column will contain the Twitter handle of the user that was retweeted. This is accomplished by using the str_extract function from the stringr 📦 which extract the pattern “RT @user_handle” from every retweet (in regular expression form ^RT @(\\w+)). Finally, mutate(RT = str_replace_all(RT,"^RT ","")) will replace every RT occurence with nothing so that only the user handle is left in the retweet_fromcolumn.

> reg_retweets <- "^RT @(\\w+)"
> all_tweets <- all_tweets %>% mutate(tweetID = 1:n()) %>% select(-X1) %>% mutate(retweet_from = str_extract(text, 
+     reg_retweets)) %>% mutate(retweet_from = str_replace_all(retweet_from, "^RT ", 
+     ""))
> all_tweets[, c("tweetID", "retweet_from", "isRetweet")]
## # A tibble: 75,000 x 3
##    tweetID retweet_from     isRetweet
##      <int> <chr>            <lgl>    
##  1       1 @acpfonline      TRUE     
##  2       2 @esricanada      TRUE     
##  3       3 @chrisshipitv    TRUE     
##  4       4 <NA>             FALSE    
##  5       5 <NA>             FALSE    
##  6       6 <NA>             FALSE    
##  7       7 @chrisshipitv    TRUE     
##  8       8 @chrisshipitv    TRUE     
##  9       9 @Raine_Miller    TRUE     
## 10      10 @Nova_Magazine17 TRUE     
## # ... with 74,990 more rows

Now let’s convert everything to the one-token-per-row format 🌟!

The code below will take the data frame all_tweets and convert it into a tidy text format. At this stage we do not filter out retweets but we will know which words come from a retweeted tweet and who was retweeted thanks to the columns isRetweet and retweet_from. The following is a breakdown of the steps performed by the chunk below:

  • Take the all_tweets data frame;
  • filter(!duplicated(all_tweets)) to filter duplicated tweets;
  • mutate(text = str_replace_all(text, replace_reg, "")) will clean up the text from all unnecessary signs and language elements that are not meaningful text (summarised in the regular expression replace_reg);
  • unnest_tokens(word, text, token = "regex", pattern = unnest_reg) is the key part of this step and this is when the 💫 tidy text ✨ will happen!! Here we used a regular expression as our token. This is a very useful trick that we borrowed from Text mining with R 📕 and allows us to keep # and @ from Twitter text.
  • We then remove stop words using the stop_words data frame provided with tidytext 📦 and we filter out, for any retweet, any word that contains the handle of the user retweeted: filter(!(word==retweet_from)). Keeping these words would overestimate the number of times that a user was mentioned since the handle wasn’t part of the actual tweet but is appears in retweeted tweets by default.
> replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
> unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
> all_words <- all_tweets %>% #
+   filter(!duplicated(all_tweets)) %>% #
+   mutate(text = str_replace_all(text, replace_reg, "")) %>%
+   unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
+   filter(!word %in% stop_words$word,str_detect(word, "[a-z]")) %>% 
+   filter(!(word==retweet_from))

Save dataset

> dir.create("Twitter_Tutorial_data", showWarnings = FALSE)
> write_csv(all_words, path = "Twitter_Tutorial_data/combined_royal_time_series.csv")

Hashtag frequency

At the time when we set up our cronjob, it wasn’t clear which hashtag would be used to refer to the Royal Wedding. So first let’s investigate which hashtag of the ones we searched for was the most popular.

> our_hash <- all_words %>% mutate(OurTweets = word %in% str_to_lower(Hashtags$Hashtags)) %>% 
+     filter(OurTweets) %>% group_by(word) %>% count() %>% ungroup() %>% mutate(word = reorder(word, 
+     n)) %>% ggplot(aes(x = word, y = n)) + geom_bar(stat = "identity") + coord_flip() + 
+     ggtitle("Our hashtags") + theme_bw()
> 
> our_hash

Since hashtags are used to associate tweets to a topic/trend/theme it can be interesting to investigate, which other hashtags were used in co-occurance. Here we find all other hashtags that were used more than 80 times. We will plot these in a wordcloud using the wordcloud2 📦. A wordcloud gives greater prominence to words that appear more frequently in the source text by making them appear larger.

> library(wordcloud2)
> 
> all_hash <- all_words %>% filter(!(word %in% str_to_lower(Hashtags$Hashtags))) %>% 
+     filter(str_detect(string = word, pattern = "#")) %>% group_by(word) %>% 
+     count() %>% filter(n > 80) %>% ungroup() %>% mutate(word = reorder(word, 
+     n)) %>% arrange(desc(word)) %>% rename(freq = n)
> 
> wordcloud2(all_hash)

We can also make a wordcloud in the shape of a crown 👑. However for this to look good we need to also use words with lower frequency.

> # make wordcloud with more words and in the shape of a crown
> all_hash_more <- all_words %>% filter(!(word %in% str_to_lower(Hashtags$Hashtags))) %>% 
+     filter(str_detect(string = word, pattern = "#")) %>% group_by(word) %>% 
+     count() %>% ungroup() %>% mutate(word = reorder(word, n)) %>% arrange(desc(word)) %>% 
+     rename(freq = n)
> 
> 
> crown_path <- "crown.jpeg"
> hw <- wordcloud2(all_hash_more, size = 1, figPath = crown_path)
> hw

Interestingly, when we find a lot of hashtags referring to competitions, i.e. #win, #competition, #WinItWednesday. This may be an artefact of collecting data on Thurdays, as there are lots of companies trying to entice users to retweet their content with giveaway competitions such as #WinItWednesday:

Time series of Royal 👑 Hastags

> library(plotly)
> 
> # Analyse only retweets
> commonTweetedHash <- all_words %>%
+   separate(created,into=c("Day","Time"),remove=FALSE,sep=" ") %>%
+   filter(str_detect(string = word,pattern = "#")) %>% 
+   filter(!is.na(retweet_from)) %>% # keep only retweets
+   group_by(Day,word) %>%
+   count() %>%
+   ungroup() %>%
+   group_by(word) %>%
+   summarise(MaxOneDay=max(n), # max number of mention in one signle day
+             SumMentions=sum(n)) %>% # sum of all the mentions for this tweets
+   arrange(desc(MaxOneDay))
> kable(head(commonTweetedHash))
word MaxOneDay SumMentions
#royalwedding 3751 33131
#competition 963 1610
#win 963 1893
#champagne 482 519
#winitwednesday 477 644
#champ 474 609

There are around 5K hashtags! We cannot follow them all across time.

Let’s do some # filtering… We will remove our searc # and the super super popular ones (first 20 of the table abve)! Out of the remaining ones we will keep the most popular.

> tweetsDay_medium_popular <- all_words %>% 
+   filter(!is.na(retweet_from)) %>%
+   separate(created,into=c("Day","Time"),remove=FALSE,sep=" ") %>%
+   filter(str_detect(string = word,pattern = "#")) %>% # only keep hastags
+   filter(!(word %in% Hashtags$Hashtags) &  # remove our search hastags
+            !(word %in% tolower(Hashtags$Hashtags)) & 
+            !(word %in% commonTweetedHash$word[1:10])) %>%
+   group_by(Day,word) %>% 
+   count() %>%
+   arrange(desc(n))
> 
> # Only plot trends for medium popular hashtags
> tweetsDay <- all_words %>% 
+   separate(created,into=c("Day","Time"),remove=FALSE,sep=" ") %>%
+   filter(str_detect(string = word,pattern = "#")) %>% # only keep hastags
+   filter(word %in% tweetsDay_medium_popular$word[1:10]) %>%
+   group_by(Day,word) %>%
+   count() 
> 
> plot_tweetsDay <- ggplot(tweetsDay,aes(x=Day,y=n,colour=word,group=word)) + geom_line() + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
+   geom_vline(xintercept = c(36), linetype = "dotted")
> plot_tweetsDay

> p1 <- plot_tweetsDay + theme(legend.position = "none")
> ggplotly(p1)
  • Peak of tweets for #royalwedding on the 20th of May

  • #chogm2018: Commonwealth Heads of Government Meeting 2018 16-19 April 2018

> chogm <- all_words[all_words$word %in% "#chogm2018", ]
> dim(chogm)
## [1] 116  18
> sort(table(chogm$retweet_from))
## 
## @AlexDEMitchell      @hannarrr_   @PScotlandCSG  @mmarklefancom 
##               1               1               1               2 
##        @The_ACU         @PlanUK    @RoyalDickie   @RE_DailyMail 
##               2               3               3             103
> chogm[chogm$retweet_from %in% "@RE_DailyMail", c("tweetID")]
## # A tibble: 103 x 1
##    tweetID
##      <int>
##  1   17515
##  2   17041
##  3   16333
##  4   16332
##  5   16328
##  6   16327
##  7   16326
##  8   16323
##  9   16321
## 10   16319
## # ... with 93 more rows
> all_tweets$text[all_tweets$tweetID %in% c("15052", "15054", "15119")]
## [1] "RT @RE_DailyMail: #PrinceHarry and #MeghanMarkle have arrived to speak to inspirational Commonwealth Youth leaders #CHOGM2018 https://t.co/<U+2026>"       
## [2] "RT @RE_DailyMail: Video: #PrinceHarry and #MeghanMarkle meet delegates at today<U+2019>s Commonwealth Youth Forum event #CHOGM2018 https://t.co/oH<U+2026>"
## [3] "RT @RE_DailyMail: #PrinceHarry and #MeghanMarkle have arrived to speak to inspirational Commonwealth Youth leaders #CHOGM2018 https://t.co/<U+2026>"